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Abstract : The aim of this note is to propose a definition of the scientific diversity 
and corollarly, a measure of the "interdisciplinarity" of collaborations. With respect 
to previous studies, the proposed approach consists of 2 steps : first, the definition of 
similarity between journals and second, these similarities are used to characterize the 
homogeneity (or, on the contrary the diversity) of a publication list (that can be for one 
individual or a team). 

1 Introduction 

Interdisciplinarity is, nowadays, of interest for several reasons and by lots of people 
and institutions. Let us just quote two recent initiatives in France : the creation of the 
"mission interdisciplinaire" at CNRS ( http : / /www . cnrs . f r /mi /) or the report 
of the AERES that proposes interesting direction for evaluation of interdisplinary [1] 
based on qualitive analysis. We do not intend to discuss the reasons of such interest 
and refer to [2] for a detailled and recent review about interdisciplinarity. 

In this note, we propose a method for quantifying the interdisciplinarity only based 
on bibliometric data, without any a priori classification of scientific domains and/or 
arbitrary knowledge on their proximity. The obtained results should be compared with 
existing classification and analysed by scientists to validate (or not !) their meaningful 
interest. 



The actual note is a very preliminary description of the idea and it has not been 
tested on bibliometric data. The author is not an expert in scientometrics and do not 
have acces to large database that are necessary to test the approach. Lot of studies have 
been done about co-authorship (see e.g. [3] and the reference cited therein, at bottom 
of page 159). However, the actual approach (with two steps as detailled after) has not 
yet been proposed up to my knowledge. In the actual version, this note is not aimed 
for publication and suggestions are warmly welcome in particular to be informed about 
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previous works in the same spirit. 

The goal is to define a measure of the interdisciplinarity within a publication list 
(for one individual, team, laboratory, institution). Such quantitive information has to 
be complemented by a finer analysis for scientists to determine the corresponding rel- 
evance of the scientific collaborations. We just try to propose an approach to see its 
faisability and, hopefully, to prove its capability to characterize interdisciplinary stud- 
ies. 

The proposed approach is based in two steps : first, we define, from a biblio- 
graphical database, a measure of the similarity between scientific journals based 
on co-authorship i.e. the more 2 journals have co-authors, the closer they are (this 
will be made more precise later). It can be objected that we do not measure sci- 
entific "proximity" but actual practices of publications. The second step consists 
in using these similarities to characterize whether or not a publication list is an 
scientifically "homogeneous" set. 

It is believed that information on co-authorship is more reliable to evaluate pluridis- 
ciplinary collaborations than using citations. Indeed, it is rather common that a paper 
e.g. in mathematics cites several articles in an application domain, to illustrate the ori- 
gin of the scientific problem or to justify the modelling choice done but the core of the 
paper can be entirely focused on mathematical analysis. On the other hand, signing a 
paper with colleague mean (we hope so) a mutual interest and work within the paper. 

Note also that publishing a paper in a so called pluridisciplinary journal (what is 
the definition of such journals ?) does not mean that the article is itself the result of a 
collaboration between several scientific domain. 

It can be argued that the proposed approach measures more the originality of a set 
of publication, that is the reason why it is called scientific diversity, since the similarity 
counts for the existing collaborations, even if they are already interdisciplinary. 

Several web site indeed provide informations on scientific collaborations such as 
researgate, resaerchid, googlescholar, sciencewatch (non exhaustive list) and can be 
interested in providing new services and informations to their visitors (see in the re- 
minder for examples). 

The proposed analysis can also be of interest for editors of scientific journal (they 
may already have similar tools but they are not known by the author). The method is 
presented in a algorithmical way in order to facilitate its implementation. Let repeat 
that experiment feedbacks are welcome. 

2 Data , notations 

The method relies on bibliographical data (the use of the largest possible database will 
provide the more relevant informations). 
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Let us precise the notations. 

The database consists in a list of articles, that will be denoted by an unique iden- 
tifier (that can be consider as in integer) using the letter i. Each article i (where 
i E 1 ■ ■ ■ N) will be described by 

• the journal of publication, denoted by an unique identifier, j. More precisely, 

is the journal where article i has been published. The list of journal is finite 
(even if its length increases with the creation of new journals each year). To fix 
the idea about 13000 journals are included in the Thomson-ISI database. 

• y{i) is the year of publication of the article i. 

• K{i) is the list of (co-)authors of the article i. The authors have to be identified 
i.e. each individual should have an unique identifier, that can be represent by an 
integer. We will use k for authors. Thus, k G K{i) means that the author k is 
(one of) the author of article i ' 

• p{i) is the number of pages of the article. This is useful to differentiate short 
note and more detailed study although this can be discussed. The interest of a 
paper is, of course, not proportional to its length but it can be consider as an 
useful indicator, once renormalized for a given journal (or a given author). 

In this note, we will use capital letters to represent lists. J is the (finite) list of all 
the journals. / is the list of all the articles. For example, we shall note /(j) the list 
of articles published by the journal j or /(j, k) the list of articles published by author 
k in the journal j or /(j, k, y) the list of articles published by author k in the journal 
j within the year y. Similarly J{k,y) represents the list of journals where author k 
published in the year y. 

We note with N the cardinal of a set. E.g. N(I(k)) is the total number of articles 
by author k. P is the total number of pages, e.g. P(j, y) = J2i'ei{j,y) P{^') the total 
number of pages in the journal j during year y. 

According to usual consideration in the mathematical science field (which is the 
domain of the author), the weigth of a given article will be shared uniformly between 
all the authors of the article. This point is naturally questionnable but claiming that the 
importance of a paper is proportional to its number of authors as, e.g. in the computa- 
tion of citations or impact factor can also be under discussion. Let us refer to ] for a 
discussion on the question of multiple authorship. 



This is the reason why I can not test the proposed approach on the data of HAL french pubHcation 
deposit. 
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3 Journal similarity 



Using these notations, we shall now define the similarity between journals by consid- 
ering, for all articles and of (co-)authors of this article, all other articles by the same 
author. 

More precisely, for alH G / and all k E K{i) and all i' E I{k), the similarity 
betwen journal and j(z') increases as follows 

S0(,)..(O)+ = .nm(^.^^). (1) 

note that S {j (i) , j {i')) will be increased by the same value (when exchanging the role 
of i an i'). We propose to increment the similarity between the 2 journals by the min 
value of the "weight" of the 2 articles (number of pages divided by the number of au- 
thors) instead of using the arithmetic mean e.g. because it is believed that the scientific 
proximity is stronger if the 2 papers have the same weights (with the same arithmetic 
weigth). Other choices, like for example, a geometric mean (^/ab) may give better 
results. The only way to choose the right formulae will be to test several choices and 
compare the obtained similarity matrix (see below some ideas to help in the choice or 
in the validation of the relevant definition of similarity). 

One can discuss about the normalization of the page number i.e. to divide the 
number of pages with respect with the total number of pages withinn the journal j. In 
other words, we propose to replace p{i) is the above equation by 

p(t)=pit)/Pij(t)). 

These variants should be tested as soon as data are available. It is obvious that the 
non-normalized choice will increase the impact of journals which produces a lot of pa- 
per and/or pages whereas the similarity should measure a proximity between journal 
that should not be correlated to the "size" of the journal. Therefore, the normalization 
using normalized by the total number of paper should be more pertinent. 

Remarks 

- By construction, S{j,j) > 1 since i E I{k), \/k E K{i) (or S{j,j) > P{j) if we 
do not use the normalize version). The effective value will measure if the same authors 
are lot of their publications in the journal j. 

- Similarity can be computed for a given year (or period) by taking into account 
only the article of the corresponding year (or period). 

- It is clear that the matrix S will be coarse and it will be necessary to take into 
account the second order co-authorship (i.e. the co-author of co-authors, following the 
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idea of the Erdos number ref?). Let us define by summing the binary interactions as 
follows 

j" 

By construction we see that 5*2 = S*^. Then we can use S = S + 6S2 where 6* is a 
constant than represents the relative weigth of secondary co-autorship. Once again, 
this should be tested on a real / huge database (see below). 

Validation phase 

At this stage, it will be necessary to test if the proposed similarity fits with usually 
used classification by scientific domain. More precisely, it will be interesting, using a 
given disciplinary classification, to verify wheater or not the averaged similarity inside 
a scientific domain is smaller (or not and to what extend) than the same average over 
all journals. It can serve also to compute the average similarity between two choosen 
different domains by computing the average value of S{j, j') for any j in domain 1 and 
j' in domain 2. This will provide a similarity matrix between scientific domains and it 
has to be analysed if it corresponds to usual classifications of scientific domains. 

Note that, when considering article from "multidisciplinary" journals (using an 
arbitrary list/classification), its citations are affected to a domain, according to the cita- 
tion in the article (see http : / / sciencewatch . com/ about /met / classpapmult i jour/. 

Other study like clustering can be developped using this similarity between jour- 
nals [5, 6]. 

It can also be checked if the "generalist" journals have a larger (average) similar- 
ity than the more specific one. One can e.g. use the 22 so-called broad fields (see 
http : / / www .in-cites. com/ journal-list / ) in the "Essential Science In- 
dicators database" of Thomson-ISI . 

Possible Services - Utilities 

Once validated, this similarity between journals can be of interest for editors to 
evaluate the impact of their editorial choice on the scientific positioning. For example, 
if an effort is done in order to encourage paper in a near domain (that corresponds to a 
given subset of journals), it can be observed that the average similarity of the journal 
with the one of the given subset increases with time (by computing the averaged simi- 
larity restricted to successive years). It can also help editors to see if a journal evolves 
for a larger specialization or on the contrary is more and more multidisciplinary. 
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4 Interdisciplinarity or scientific diversity index 



Let us now consider that the matrix of similarity S is known (and validated). 

In this section, we will construct an index for any arbitrary list L of publications 
(that can be the one of a person, a team, a laboratory, an institution, a journal, an 
editor). 

Let us define the, so called scientif diversity index SD of the list L as the averaged 
similarity between journal in the list weighted with the respective weights of the article. 
In other words, 

SD{L) = E S{m,m)p{^)/N{m). (2) 

Note that the index isnot related to the quantity of paper (if one duplicates the list, the 
number of elements in the double sum is multiplied by 4 but N{L) is multiply by 2 
and the value is unchanged). 

The SD index has not to be considered as an indicator of the quality of the articles 
in list L , but on the contrary this is a qualitative indicator on this list of articles. Note 
that this index is constructed using statistical / averaged bibliographical quantities. It 
is therefore very questionnable to use it on a small list of articles and thus, it is likely 
more suitable to characterize collective list of publications than the one of individuals 
except for scientists with a sufficiently long publication list in order the result to be 
significative (for such scientists, it will be interesting to see is their SD is correlated 
with the number of articles (/ index i.e. N{I{k) with our notation) or citations (h 
index for example). 

They are lot of study that can be done using this index, which, again, does not 
measure the "quality" or the "importance" or "impact" (in terms of influence on other 
scientists) but only the relative diversity or variety or originality of a list of publica- 
tions with respect to others. 

Such an indicator will only have to be used with lot of care and only relative com- 
parison make sense (the exact value has no interest). For example, one can think to 
compute, for the list of a given author (denoted by I{k) with our notations), one can 
compare the value of SD{I{k)) with the corresponding value for all co-author of k. 

If such web service is implemented, it can be asked to the author if his/her proposed 
ranking with respect to the one of his/her co-authors in term of "scientific diversity" 
seems to him/her relevant or not? This will be, in my opinion, a good way to evaluate 
if the proposed indicator gives information that fit with the general opinion. If some 
variant of the above definition are proposed, it can be checked which definition better 
indicate the scientific diversity. 
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5 Central journal of a publication list 



One can also use the similarity matrix to define for any list L the central journal by 
looking which journal in the list J{L) maximise the similarity with 

m = U e J(L) s,. = max^gS(/.,W)^}. 

(3) 

In other words, if we interpret similarity as the inverse of a pseudo distance, the 
central journal is the one that will minimize the average distance with the other in the 
list J{L). This may be related to the Fermat-Weber point of Frechet mean. 

It may be interesting to rank the journals of the list by decreasing value of their 
averaged similarity with others (as defined above). It should give at the top of the list, 
the journals that correspond to the principal domain of the author or list and, at the 
end, the journals that are scientifically far from his/her speciality. 

As for the comparison of the scientific diversity (SD) of his/her co-authors, one 
can think to ask, for such web service, if the ranking correspond to what is usually 
admitted (using a pool, and eventually, comparing the results of other definition). 

One other possible service that can be useful for scientist is to propose some jour- 
nals in which they never have published but which are "close" in the sense that the 
similarity is high with their central journal (one can restrict the suggestions of paper in 
the list of journals where their co-authors have published). This can suggest to enlarge 
their list of journals and avoid scientific concentration. 

It is also possible to give information about the evolution of their scientific diver- 
sity over years. Note that this definition of a "central journal" is only an example of the 
use of the similarity index between journals and lot of other concepts can be proposed 
using tools of graph theory, network analysis, clustering... 

Once again, feedbacks are welcome ! 
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Mancini (U. Orleans), L. Cappelli (CCSD, Lyon), V. Miele (CNRS, Lyon) 
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